test+fix(orchestration): rip maxTurns; restore answer-the-user's-question behavior (#598 follow-up) by wyuc · Pull Request #599 · THU-MAIC/OpenMAIC

wyuc · 2026-05-26T05:06:25Z

Closes #598 follow-up items 1 and 2 (and surfaces additional in-product fixes discovered during the prod investigation).

The root symptom this PR addresses: agents drift onto adjacent topics rather than answering the user's literal question ("答非所问"). The director then keeps picking peer agents for "variety" instead of routing back to the teacher to actually answer. Frustration signals from the user are just one sub-case where this is loud enough to notice; the underlying bug fires earlier, often on the first response.

What's in this PR

Commit 1: premature-END regression guard

5 scenarios modelled on #511's described symptoms. A/B across 9 models × 5 scenarios × 5 samples per variant: 0% END rate in both variants on every scenario. The prompt-layer rules added in #554 don't measurably change director END behavior on these inputs. Kept as a regression guard so any future change that raises the END rate trips this eval.

Commit 2: director question-answering eval

5 synthesized scenarios — 3 first-turn drift (no frustration yet), 2 with escalated frustration. Decision rule-judged: USER | TEACHER | OTHER_AGENT | END. A/B: current main vs main with rule 13 stripped.

Commit 3: director rule 13 — Answering the User's Question

Constrained-output rule injected at the end of the # Rules section. Triggers whenever the conversation has an unanswered user question. Constraints: route to LITERAL role: teacher by default, USER cue when ambiguous or after multiple unresolved complaints, forbid peer-agent "variety" routing and END while the question is open.

On google:gemini-3-flash-preview, 5 samples per variant:

Scenario	baseline	with_rule
math_quadratic_axis_drift_no_frustration	100%	100%
bio_dark_reaction_drift_no_frustration	80%	100%
english_team_isare_drift_no_frustration	20%	100%
physics_inertial_mass_escalated_frustration	0%	100%
calculus_product_rule_drift_no_frustration	60%	100%
mean	52%	100%

3 of 5 scenarios reproduce the bug at baseline (≤60% correct, peer-agent "variety" routing dominates). The rule lifts every scenario to 100%.

Commit 4: agent rule — Answering the User's Question

Adds a section to the agent system prompt that directs the agent, on every response where the user message contains a question, to lead with the concrete answer to the literal question before any other content. Pairs with the director rule — director routes to the teacher when the question is unanswered; the teacher now has explicit guidance for what to do when dispatched.

Commit 5: rip maxTurns end-to-end + distinguish errors from completion

maxTurns rip: the director graph used maxTurns: turnCount + 1 to force "one director→agent cycle per request" — a hack expressing the single-round contract as a turn limit. Replaced with explicit graph topology (agent_generate → END). The user-facing "Max Discussion Turns" setting is also removed: with rule-driven routing (cue_user / END), a numeric cap is redundant and was previously masking director bugs by force-ending.

Error vs completion: before, every non-cue_user outcome of the client agent loop mapped to status: 'completed', conflating end (legitimate completion) with empty_turns (parser failure / model timeout) and no_done (SSE infra failure). The new switch routes the error paths through clearLiveSessionAfterError, which appends an inline System message and sets the new status: 'error'. The session list gets a red AlertCircle icon for the error state.

Touches: director-graph.ts, lib/types/chat.ts, lib/store/settings.ts, components/agent/agent-bar.tsx, components/settings/agent-settings.tsx, lib/chat/agent-loop.ts, components/chat/use-chat-sessions.ts, components/chat/session-list.tsx, eval/whiteboard-layout/runner.ts, 6 i18n locales.

Test plan

pnpm check clean
pnpm test — no new failures (9 pre-existing ssrf-guard DNS-bound failures unrelated)
npx tsc --noEmit -p tsconfig.json clean
EVAL_DIRECTOR_MODEL=<model> pnpm eval:orchestration — premature-END eval passes (regression guard)
EVAL_DIRECTOR_MODEL=<model> npx tsx --env-file=.env.local eval/orchestration/answering-runner.ts — 5/5 scenarios PASS, mean 100% correct rate with rule, baseline 52%
dev server sanity (manual) — multi-agent discussion, user asks specific question → check teacher answers directly; agent failure to produce output → check session-list error icon

Adds eval/orchestration/ following the outline-language pattern: a runner that A/B-tests the director against the same scenario with two prompt variants — current main (post-#554) and a synthesised pre-#554 baseline that strips the role-aware summary labels AND the new system.md rules 10/11/12 together. Pass criterion is now framed as a regression guard rather than a fixture discrimination test: every scenario's post-fix END rate must stay below EVAL_END_THRESHOLD (default 20%). The pre-vs-post Δ is reported as informational data, since #554's reviewer feedback was that earlier fixtures didn't discriminate. Empirical finding across 7 shipped model configs (gpt-4.1-mini, gpt-4o-mini, gpt-5.4-nano, qwen-plus, qwen3.5-flash, deepseek-chat, deepseek-v4-flash, gemini-2.5-flash, claude-haiku-4-5) with 5 scenarios modelled on #511 (incl. the exact Tiananmen-3D objection trace, soft pushback after a long resolved-looking discussion, topic pivot, brief acknowledgement, and explicit teacher-signals-end-then-user-objects): no scenario produced a non-zero END rate in either variant. Useful data for the #554/#598 discussion — the prompt-layer rules don't measurably change behavior on these shipped models with these prompts, but the eval is now wired so future regressions show up.

…w-up) Adds an A/B eval probing whether the director routes correctly when the conversation contains an unanswered user question — both in the first-turn drift case (agents started answering but drifted onto adjacent topics, no frustration yet) and in the escalated frustration case (user has already complained multiple times). The prod symptom this targets ("答非所问"): agents reply on adjacent topics rather than answering the literal question, and the director keeps picking peer agents for "variety" instead of routing back to the teacher to actually answer. Scenarios: 5 synthesized cases across math, biology, English grammar, physics, and calculus. 3 are first-turn drift (no frustration yet), 2 include frustration signals. Each captures: user asks specific question, agents reply on adjacent topics, director is asked to pick the next move. Decision rule-judged into USER | TEACHER | OTHER_AGENT | END: - USER → ✓ valid (cue user to clarify) - TEACHER → ✓ valid (re-route to teacher to answer) - OTHER_AGENT → ✗ wrong (peer-agent "variety" routing — the bug) - END → ✗ wrong A/B: - baseline : current director template with rule 13 stripped - with_rule : current director template as-shipped Pass = with_rule.correctRate ≥ EVAL_PASS_THRESHOLD (default 0.7). Pre-vs-post Δ is reported as informational only. Note this commit only adds the eval; the rule it strips/restores in the A/B is injected in the next commit.

Adds rule 13 to lib/prompts/templates/director/system.md so the director recognizes when the conversation contains an unanswered user question and routes appropriately — to the teacher (literal role match) by default, or to USER cue when the question is ambiguous or the user has already complained multiple times. This addresses the "答非所问" symptom: when agents drift onto adjacent topics, the director keeps picking peer agents for "variety" rather than routing back to the teacher to actually answer. The rule is preventive (applies whenever the question is unanswered, including the first turn after drift) rather than reactive-only (waiting for the user to express frustration). Rule wording uses constrained-output framing: literal `role: teacher` match (not "best fit" / "highest priority" / "most knowledgeable"), and an explicit list of forbidden choices (peer agents, END). Earlier softer wordings let the model pick the assistant when it seemed more topically relevant; the literal-match wording prevents that. Validated by the question-answering eval added in the previous commit. On google:gemini-3-flash-preview, 5 samples per variant across 5 synthesized scenarios: | Scenario | baseline | with_rule | |----------------------------------------------------|----------|-----------| | math_quadratic_axis_drift_no_frustration | 100% | 100% | | bio_dark_reaction_drift_no_frustration | 80% | 100% | | english_team_isare_drift_no_frustration | 20% | 100% | | physics_inertial_mass_escalated_frustration | 0% | 100% | | calculus_product_rule_drift_no_frustration | 60% | 100% | | **mean** | **52%** | **100%** | 3 of 5 scenarios reproduce the bug at baseline (correct rate ≤ 60%); the rule lifts every scenario to 100%. The remaining 2 baselines are already high — the rule doesn't regress them.

Adds an "Answering the User's Question" section to the agent system prompt. Before this, the agent SP had no rule for direct-answering — the persona, role guideline, current-state context, and length guidelines all pushed the agent to advance the lesson and inspire thought, with no explicit directive to first answer the user's literal question. The result observed in prod ("答非所问"): when the user asks a specific question (a formula, a yes/no, a term), agents reply with adjacent concepts, examples, or follow-up questions — but never the literal answer. The new section directs the agent, on every response where the user message contains a question, to: - lead with the concrete answer to the literal question, - not pivot to an adjacent topic even when it seems pedagogically richer, - treat "inspire thought" and peer-differentiation guidance as applying only AFTER the literal question is answered, - say "I don't know" directly when uncertain rather than answering a different question, - on frustration signals from prior turns, look back at the original question (before the frustration) and answer THAT specifically. This is the agent-side counterpart to the director rule landed in the previous commit. Director routes to the teacher when the question is unanswered; the teacher now has explicit guidance for what to do when dispatched.

…ors from completion Two coupled cleanups in one commit since the outcome-handler rewrite naturally falls out of removing the client-side max-turns loop. The director graph used `maxTurns: turnCount + 1` to force "one director→agent cycle per request" — a hack expressing the single-round contract as a turn limit. This commit removes the hack and the surrounding user-facing setting, replacing implicit intent with explicit graph topology. Server: - `director-graph.ts`: drop `maxTurns` from state, drop the hard-cap check at the top of `directorNode`, change `agent_generate → director` edge to `agent_generate → END`. Single-round contract is now expressed by topology. - `lib/types/chat.ts`: drop `maxTurns` and `currentTurn` from `SessionConfig`; drop `maxTurns?` from `CreateSessionRequest.trigger`. Client: - `lib/store/settings.ts`: drop `maxTurns` state, `setMaxTurns` action, the `'10'` default, and the localStorage migration path. - `components/agent/agent-bar.tsx`: drop the entire compact-stepper UI block plus the unused MessageSquare/Minus/Plus lucide imports. - `components/settings/agent-settings.tsx`: drop the AgentSettings prop and the multi-agent-only Input block. - `lib/chat/agent-loop.ts`: drop `maxTurns` parameter, change `while (turnCount < maxTurns)` to `while (true)` controlled by inner exits, drop `'max_turns'` reason from `AgentLoopOutcome`. - `components/chat/use-chat-sessions.ts`: drop the maxTurns lookup, drop the trailing arg to runAgentLoop, drop vestigial `maxTurns: 0` / `currentTurn: 0` from session creation. - `eval/whiteboard-layout/runner.ts`: drop unused MAX_AGENT_TURNS plus the trailing arg. i18n: drop `settings.maxTurns` and `settings.maxTurnsDesc` from six locales (en-US, zh-CN, zh-TW, ja-JP, ar-SA, ru-RU). Motivation for removing the user-facing setting: the LLM director controls round length via cue_user / END decisions, so a numeric "max discussion turns" knob is redundant with rule-driven routing. When the director prompt was buggy, the cap masked the bug by force- ending; once director routing is fixed, the cap just removes user agency. Before, every non-`cue_user` outcome of the agent loop mapped to `status: 'completed'` in `use-chat-sessions.ts`. That conflated four distinct paths: - `end`: director chose END (legitimate completion) - `empty_turns`: two consecutive agents returned no text or actions (parser failure, model timeout, etc.) - `no_done`: SSE stream ended without a `done` event (infrastructure failure) - `aborted`: user pressed stop (handled elsewhere) The new switch routes `end` to completed, `empty_turns` and `no_done` through `clearLiveSessionAfterError` so the user sees an inline System message describing what went wrong instead of a quiet "completed". - `lib/types/chat.ts`: add `'error'` to `SessionStatus`. - `components/chat/use-chat-sessions.ts`: rewrite the outcome handler as a switch; have `clearLiveSessionAfterError` set `status: 'error'` so all entry points (the runAgentLoop outcome AND existing throw paths from fetchChat / SSE error events) converge on the same visible state. - `components/chat/session-list.tsx`: add `AlertCircle` red icon for the new error status. - i18n: add `chat.error.emptyAgentResponses` and `chat.error.streamInterrupted` strings to all six locales. This is the in-product fix that the prod investigation surfaced as an independent root cause of the #511 "discussion ended unexpectedly" symptom — separate from the director and agent prompt fixes in the earlier commits of this PR.

…nswering script Two minor CR follow-ups: - components/chat/use-chat-sessions.ts: the `settingsState` lookup at the top of runAgentLoopFn was the entry point for the maxTurns lookup ripped in the previous commit; with maxTurns gone, the binding is unused and ESLint flags it. Drop the line. (Other uses of useSettingsStore.getState() elsewhere in this file are unaffected.) - package.json: the answering eval runner had no convenience script, while the premature-end runner did. Add eval:orchestration:answering for symmetry and discoverability.

wyuc · 2026-05-27T15:47:41Z

CR loop summary (2 rounds, converged)

Ran 2 rounds of code review via the superpowers:requesting-code-review subagent on the full PR diff (origin/main..HEAD).

Round 1 flagged 2 Important issues:

components/chat/use-chat-sessions.ts:460 — dead settingsState binding (orphan from the maxTurns rip), tripping @typescript-eslint/no-unused-vars.
package.json — missing eval:orchestration:answering convenience script (asymmetric with eval:orchestration).

Both fixed in 9a90e84 (4 lines).

Round 2 verified:

ESLint clean on the touched file.
pnpm exec tsc --noEmit clean.
pnpm check (prettier) clean.
pnpm check:i18n-keys passes (7 locales, including pt-BR).
AgentLoopOutcome.reason union is exhaustively handled by the new outcome switch.
maxTurns / currentTurn / max_turns rip is complete; only remaining references are two explanatory comments in director-graph.ts.

Round 2 verdict: loop converged, no new findings.

Out-of-scope items deliberately left for follow-up (called out but not addressed in this PR):

Pre-existing directorNode error catch still routes to shouldEnd: true (i.e. completed rather than error) — independent path from this PR's outcome-handler upgrade.
AgentSettings component is orphaned on origin/main too; not introduced by this PR.
Retry-last-message affordance for the new 'error' session state — separate UX work.

wyuc mentioned this pull request May 26, 2026

Follow-up to #554: remaining items for #511 #598

Open

ashutoshrana mentioned this pull request May 26, 2026

fix(orchestration): add agent encoding note, tighten Rule 11, add edge-case tests [AI-assisted] #600

Closed

8 tasks

wyuc changed the title ~~test(orchestration): premature-END regression eval (follow-up to #554)~~ test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced) May 26, 2026

wyuc force-pushed the feat/598-eval-orchestration branch from 6cd1160 to ca403bc Compare May 27, 2026 02:47

wyuc changed the title ~~test(orchestration): probe #511 root causes — premature-END (refuted) + frustration handling (reproduced)~~ test(orchestration): premature-END regression eval + frustration-handling eval (#598 follow-up) May 27, 2026

wyuc force-pushed the feat/598-eval-orchestration branch from 135f659 to f5950ca Compare May 27, 2026 09:21

wyuc changed the title ~~test(orchestration): premature-END regression eval + frustration-handling eval (#598 follow-up)~~ test+fix(orchestration): rip maxTurns; restore answer-the-user's-question behavior (#598 follow-up) May 27, 2026

wyuc force-pushed the feat/598-eval-orchestration branch from f5950ca to ca4ad98 Compare May 27, 2026 09:31

wyuc added 5 commits May 27, 2026 05:31

wyuc force-pushed the feat/598-eval-orchestration branch from ca4ad98 to f3fa658 Compare May 27, 2026 09:33

wyuc marked this pull request as ready for review May 27, 2026 15:47

wyuc requested a review from cosarah May 27, 2026 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test+fix(orchestration): rip maxTurns; restore answer-the-user's-question behavior (#598 follow-up)#599

test+fix(orchestration): rip maxTurns; restore answer-the-user's-question behavior (#598 follow-up)#599
wyuc wants to merge 6 commits into
mainfrom
feat/598-eval-orchestration

wyuc commented May 26, 2026 •

edited

Loading

Uh oh!

wyuc commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wyuc commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in this PR

Commit 1: premature-END regression guard

Commit 2: director question-answering eval

Commit 3: director rule 13 — Answering the User's Question

Commit 4: agent rule — Answering the User's Question

Commit 5: rip maxTurns end-to-end + distinguish errors from completion

Test plan

Uh oh!

wyuc commented May 27, 2026

CR loop summary (2 rounds, converged)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wyuc commented May 26, 2026 •

edited

Loading